[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac · 2025-02-25T07:19:54Z

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1

GGML Adaptation Layer
- Graph Caching, Mapping, and Execution:
  - Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
  - Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
  - Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
- Tensor Binding and Execution Flow:
  - Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
  - Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
QNN Object Layer
- QNN System and Instance Management:
  - Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
  - Manages QNN instance creation and initialization via the qnn_instance class
  - Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
  - Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
- Dynamic Resource Handling:
  - Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
  - Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
Utility Layer
- Dynamic Library Loading & Search Path Management:
  - Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
  - Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
- General Utilities:
  - Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

Graph Mapping Mechanism:
- Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
- Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
- The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
Backend Context and Device Management:
- Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
- Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Build

For build instructions please refer to this page

Testing

Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

Platform test-backend-ops full console output

Android test-backend-ops_all_android_ff033e1.log

Linux test-backend-ops_all_linux_ff033e1.log

Windows To be fill
Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
Table below shows GIFs of qnn backend running on different platforms

Platform Soc Model Gif Origin video

Android 8 Gen 2 llama-3-8B-Instruct-Q4_K_M Recording_Muted_hevc.mp4

Windows To be fill

Current state

The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
Testing with llama3.2-1b/3b-f16/32 models yields expected results.
Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

Further feature support and device-specific optimizations are planned (see also the project backlog).
Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

… Direct) backend

…neously

…ously and thread safe

…ing to review comments

…lained in https://github.com/zhouwg/llama.cpp/pull/1

chraac · 2025-03-22T05:30:34Z

ggml/src/ggml-qnn/dl-loader.hpp

+    auto old_mode = SetErrorMode(SEM_FAILCRITICALERRORS);
+    SetErrorMode(old_mode | SEM_FAILCRITICALERRORS);
+
+    auto handle = LoadLibraryA(lib_path.c_str());  // TODO: use wstring version for unicode paths


Hi @slaren, noticed we have similar dynamic library loading functionality in ggml-bacnend-reg.cpp (the dl_load_library function) that could be useful in other parts of the codebase.
I suggest moving this to a common utility module so we can reuse it across the project. This would help reduce code duplication and provide a consistent approach to loading libraries.
I'd be happy to prepare another PR about that, WDYT?

Sorry, I missed this. I think that this code is small enough that it is not really a problem if it is duplicated in a backend, and making it part of the public API available to backends may make it harder to change it in the future. So at the moment my preference would be to avoid this.

* add op define xml * copy qnn libs in cmake * fix htp skel path * add windows copy file list * wip * add generated package * remove unused params * add cmake list * set qnn sdk and hexagon sdk path * wip * wip * fix tools version * fix compiling error * fix dims calc * wip * add mulmat 2d * wip * reduction * wip * wip * fix compiling error in x64 * wip * fix device description in emulator * wip * add flag * copy necessary libs * wip * load HtpPrepare first for emulator * enable custom op for 2d matrix * verify op config before add to node * Revert "verify op config before add to node" This reverts commit 206dec8. * wip * wip * wip * revert tool version change * use hexagon sdk version 5.5.0 https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0 * wip * move to sub dir * add hexagon npu device and server lib * fix npu lib build * refactoring: rename QNNBackend enum * fix compiling error * wip * remove qnn/backend.hpp * add hexagon dsp host layer * extract rpc_mem from qnn submodule * fix dsp compiling error * wip * wip * open and lose npu device * split objects into separated files * fix linking error * add npu_tensor * add host graph * map rpc buffer before usage * fix some todos * add shared module * split rpc_interface from rpc_mem * get get_dsp_arch from device * wip * rename host classes * fix hexagon sdk arch getter * fix device open * fix linking error * fix crash * use tensor_data_type * fix npu lib crash * fix debug log print * skip empty graph * wip * add log * fix unmap fail * fix tensor set * remove some logs * flush back memory after finished * fix nb * wip * wip * add helper function * impl add op * fix some add in test-backend-ops * add elt wise sub and mul * fix crash on some inplace op * wip * fix elt wise op calc * wip * split mul_mat into file * add caps array * wip * wip * print support/unsupport op * copy lldb-server for newer android sdk * add tensor_spec * add assert * fix crash when loading model * rename cmake option * fix name * fix device memory and description * fix compiling error on qnn only build * fix some potential UBs * fix comments

* add qurt_thread * add thread pool * add thread_pool obj at device ctx * wip * small refactoring to fit the thread pool structure * set start/end threads for add * init thread pool * fix thread creation * split complete and pending signals * opt mulmat * wip * 2 threads * back to 4 threads * use barrier * remove some unnecessary package * add multi thread support for mul mat * wip * use qurt_barrier_t instead of qurt_signal_t * wip * wip * add log * split qnn cmake config * create function to calculate the start and end func * wip * fix comment * fix comment * fix comment * wip * fix typo

* add f16 support to etl wise op * wip * Revert "wip" This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b. * qf32 for mul * wip * Revert "wip" This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748. * disable fp16 add/sub * tempate trick * wip * add f16 mulmat * add log * fix view liked op * add log * fix f16 mulmat * add quant type * wip * add l2fetch * add vtcm_mem * wip * fix fetch * use vtcm cache in mulmat * revert vtcm cache * cache plane * small opt for plane cache * cache plane for some element wise op * wip * enable fetch even on vtcm * wip * copy sysMonApp * small opt * init ltu * add compute_params * add op common header * move vtcm_mem allocation to compute_param * fallback to memcache when vtcm allocate failed * pre-calculate quantize type * wip * try fix test failure * try fix mulmat nan * fix inf in mulmat * remove debug logs * wip * small refactoring on the dequant row func * fix typo * improve logging * add q4_0 and q8_0 * wip * wip * build hexagon libs in cmake * wip * fix qnn only build flag * fix typo * fix todo * wip * wip * add to_float * use to)float directly instead of ltu * wip * cache f16_to_f32 table into vtcm * print tensor dims at log * init device in supports_op_impl * revert cache ltu * wip * wip * fix graph calc issues by validate cache manually after each op * add cache invalidate func * enable cache fallback only in quantize tensors * add option to disable quantized tensors * propagate the asan flag to npu build * fix asan option * wip * invalidate tensors after finished * implement backend_buffer_reset * wip * wip * refactoring plane cache mechanism * wip * split row elements across thread * use table for f16 to f32 conversion * sync after each op * small refactoring to invalidate l2 cahce * wip * opt on float fetching * unroll for loop manually * reduce vtcm usage * add perf tracking for npu * print dimensions for profiler log * wip * wip * wip * add sub proc tracker * fix typo * print pcycles * wip * wip * prefetch rows * add l2fetch_row * small tweak based on perf tracer * opt l2 fetching * wip

* wip * refactor: rewrite dequantize_row_q4_0 by intrinsic * log for debug * fix q4 intrinsic * small opt * wip * wip * add vtcm_quota_size * add perf log for hexagon-npu backend * wip * add log * sync after a specfic op * increase worker thread priority * fix unbalanced thread slice * small slict to fit in vtcm cache * limit the supported row element size * opt 4_0 dequant * fix q4 dequant * add power_utils * add rms_norm * wip * enable rms_norm f32 * fix rms_norm with param * fix compiling flags * use float * fix small row size * vectorized rms norm * wip * read 2 vectors * rename * add perf log on update * set empty tensors handle also * merge some rpc functions * opt param update * wip * print more log * add struct for update param config * add npu_device_graph_set_tensor_with_param * merge tensor and params update * wip * wip * make as template to reuse * vectorize dequantize_row_q8_0 * opt * avoid using union to store q data * wip * wip * wip

* add flash attn op * expend src tensor size * add flash attn sources * add quantize row functions * make a separated file for vec_dot * wip * wip * refactor: rename quants.hpp includes and add vec_dot to type traits * add flash_attn impl * split vec_scale_f32 * move vec_reduction_qf32 to vec_ops * add vec_scale_f16 * opt * add vec_mad * implement vec_mad_f16 * opt * add op template * opt * add align version * enable flash attn * wip * log print improve * add profiler log * wip * wip * add multi sub proc perf tracker * increase log buffer * remove sub prov pcycle * wip * wip * add prefetch for vec_dot * wip * wip * opt f16 vec dot * opt f16 vecdot * reuse vec_dot_product_impl in vec dot f32 * small opt to unblock pipeline * opt on aligned address wip * Revert "opt on aligned address" This reverts commit 27be1eb. * add profiler log at thread_pool * wip * invalidate all... * Reapply "opt on aligned address" This reverts commit f075a4c. * add is_constant for tensor config * disable align tensor opt in mul_mat * wip * wip * vec_scale_impl: unrolling the loop * wip * wip * replace reinterpret_cast with direct pointer access for write/read buffers * add fetch * wip * wip * wip * add log * check tensor shape at flash_attn * wip * wip * fix: update tensor type handling in flash_attn_impl * wip * fix: align cache size * fix: qf16->hf * fix: swap order of elements in vector combine for correct scaling * fix: opt f16 scale and mad * fix leftover fetch * wip * load into vector pair * opt cache size calculation in flash_attn_impl * refactoring: hold vtcm at thread local object * wip * add profiler log * mark tensors as modified * restrict tensor invalidation to the first thread in compute_impl * Revert "restrict tensor invalidation to the first thread in compute_impl" This reverts commit 0a8ff2b. * invalidate last tensor in compute_impl * invalidate last tensor in compute function * wip * refactor dequantize_row_q4_0 to simplify vector alignment * wip * refactoring: move VTCM quota calculation to thread pool * wip * fix: correct condition check for HEXAGON_SDK_ROOT existence * wip * wip * wip * wip * fix: update condition checks match the naming * fix: improve tensor handling checks and logging in graph and operation implementations * wip

… improve compatibility

* feat: add mixed precision dot product implementation and function declaration * feat: implement mixed precision vector dot product and conversion functions * fix: update data type handling in matrix multiplication implementation * fix: adjust row count handling in matrix multiplication implementation for accurate slicing * fix: optimize matrix multiplication implementation by unroll loop * update performance tracking for matrix multiplication implementation * add fetching * wip * fix: support F16 * F32 multiplication in is_mul_mat_supported function * fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling * fix test failure for row width 67 * try fix failed test * fix: rename aligned_address to align_down for clarity in vector alignment handling * wip * qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility * fix test failure at width == 193 * fix: replace zero vector initialization with previous vector in mixed dot product implementation * wip * fix: improve handling of last vector in mixed dot product implementation * wip * wip * wip * wip * Enhance mul_mat_f32 function to support quantized types and improve static assertions * rename * Refactor dequantization functions to use npu_device_fp16_t and improve type handling * Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16 * Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16 * Add hvx_vsf_convert_vhf function for improved vector conversion * add perf logs * Refactor dequantize_row_q4_0 for alignment * Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity * Add support for ROPE operation in NPU capabilities and related functions * Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations * enable ROPE by adding operation validation * add support to freq is null case * wip * Refactor rope_f32 to improve indexing by introducing total_planes calculation * reformat * Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers * Add performance tracking to rope_f32 function for enhanced profiling * Refactor rope_f32 to use a templated implementation * Refactor rope_impl to replace loop with memcpy for improved performance * Refactor mul_mat_impl to support quantization as a template parameter * wip * wip * Refactor rope_impl to optimize plane indexing in the processing loop * Add aligned vector dot product implementation for mixed precision types * wip * Enhance matrix multiplication for F32 and F16 types with alignment checks * Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums * Add alignment checks for matrix multiplication and vector dot products * Refactor matrix multiplication to use function pointers for improved readability and maintainability * Fix alignment check in is_dot_product_aligned to ensure correct vector size handling * Remove unused f16_to_f32_table parameter from quantization and dequantization functions * wip * Add L2 fetch for src1 plane rows in matrix multiplication implementation * wip * Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication * Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity * Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance * Refactor vector operation functions to improve clarity and consistency in variable usage * wip * wip * Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations * wip * Update load_dual_block_generic to use intrinsics * Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity * wip * wip * Optimize dequantize_row_q8_0 for improved performance by unrolling for loop * wip * wip * fix typo

# Conflicts: # ggml/src/ggml-backend-reg.cpp

# Conflicts: # ggml/CMakeLists.txt

zhou.weiguo and others added 30 commits April 24, 2024 16:28

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

b0c3013

… Direct) backend

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

d325088

… Direct) backend

rebase

c75817b

refine ggml-qnn-ut program and script to make reviewers happy

9c872cb

review: replace external declaration with NDK header file

926a866

add supportive of quantize data type Q8_0

dd29834

review: remove unused QNN helper functions

f4c5303

ggml-qnn: remove static global vars to support multi-instance simulta…

2fab33d

…neously

review: remove static global vars to support multi-instance simultane…

94ee775

…ously and thread safe

review: put qnn's internal log inside preprocessor diretive

5d691c6

review: code format using clang-format + manually modification accord…

fdf0272

…ing to review comments

review: fix a memory leak introduced by review modification which exp…

3e8b61f

…lained in https://github.com/zhouwg/llama.cpp/pull/1

npu: probe htp info and capacity of rpc ion memory

d38d4a6

ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy

5f8cfe4

ggml-qnn: refine ggml inference using QNN NPU

5269e08

ggml-qnn: refine ggml inference using QNN NPU

faaa86b

review: make a MVP(Minimum Viable PR) style PR in upstream

5598fbd

init the test array with const values

5e18cdc

add ggml_qnn_tensor_binder

6c68adc

use tensor wrapper in add

37bb926

use tensor wrapper in matmul

36e41a1

use ggml_qnn_tensor_reader for output tensor

a5679dd

use ggml_qnn_tensor_writer for all parameters

5fe7b87

rename

9456bba

fix todo

65a14d9

make the constant condition first

aeef0c6

remove TODO

dfe159f

split logger function, tensors and backend from main qnn source

9932062

remove reference of g_qnn_mgr in qnn_instance

3c491a3

fix compiling error

3fe07eb

chraac commented Mar 22, 2025

View reviewed changes

chraac added 4 commits April 3, 2025 23:57

Merge branch 'master' into dev-refactoring

e4fcdd4

Merge branch 'master' into dev-refactoring

a004951

fix compiling error after merge master

9e41f79

rmatif mentioned this pull request Apr 23, 2025

Accelerating generation with Mediatek NPU rmatif/Local-Diffusion#11

Closed

chraac added 24 commits April 24, 2025 21:33

Merge branch 'master' into dev-refactoring

a0e54cf

fix typo

161c4ee

Merge branch 'master' into dev-refactoring

aca7069

fix compiling error

039f835

fix linking error

0ce53ce

fix qnn only build flag

02af8ff

fix GGML_QNN_ENABLE_PERFORMANCE_TRACKING option

db2a125

Merge branch 'master' into dev-refactoring

54b3021

fix compiling error

2306f82

Merge branch 'master' into dev-refactoring

da5dc57

qnn fix: update device capabilities for quantized types in qnn-lib to…

332514c

… improve compatibility

Merge branch 'master' into dev-refactoring

9a0093b

fix compiling error

989772c

Merge branch 'master' into dev-refactoring

ce1167d

# Conflicts: # ggml/src/ggml-backend-reg.cpp

fix compiling error

b720e47

fix unit test failure

560729e

disable broadcast on flash_attn_ext

4a3a874

fix compiling error at new hexagon sdk

9a43a23

Merge branch 'master' into dev-refactoring

c518705

# Conflicts: # ggml/CMakeLists.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

chraac commented Feb 25, 2025 •

edited

Loading

Uh oh!

chraac Mar 22, 2025 •

edited

Loading

Uh oh!

slaren Apr 7, 2025

Uh oh!

Uh oh!

Platform	test-backend-ops	full console output
Android		test-backend-ops_all_android_ff033e1.log
Linux		test-backend-ops_all_linux_ff033e1.log
Windows	To be fill

Platform	Soc	Model	Gif	Origin video
Android	8 Gen 2	llama-3-8B-Instruct-Q4_K_M		Recording_Muted_hevc.mp4
Windows	To be fill

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Are you sure you want to change the base?

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

Conversation

chraac commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features and Improvements

Build

Testing

Current state

Future development

Uh oh!

chraac Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chraac commented Feb 25, 2025 •

edited

Loading

chraac Mar 22, 2025 •

edited

Loading